## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## [1] "initializing data.frame (2nt) ..."
## user system elapsed
## 18.120 1.259 26.473
## [1] "counting footprints (2nt) ..."
## user system elapsed
## 0.625 1.135 52.480
## [1] "initializing data.frame (3nt) ..."
## user system elapsed
## 18.782 1.297 27.078
## [1] "counting footprints (3nt) ..."
## user system elapsed
## 0.523 0.849 48.063
footprint count ~ exp(transcript + Asite + Psite + Esite + 5’ digest length + 3’ digest length + 5’ bias sequence + 3’ bias sequence)
distributions for footprint count:
poisson: \(var(x) = E[x]\)
quasi-poisson: \(var(x) = \phi \cdot E[x]\)
negative binomial: \(var(x) = E[x] + \frac{(E[x])^2}{\phi}\)
bias parameters: modeled as 2nt or 3nt sequences
Model predictions better recapitulate original data if we model recovery biases with sequences that are 3 nucleotides in length (versus 2 nucleotides).
Not much visual difference between GLM families (i.e. poisson vs. quasi-poisson vs. negative binomial).
Residuals are much smaller in magnitude with 3nt bias sequence models.
Not much visual difference between GLM families. Models tend to fail in similar places and similar directions (i.e. only vary in magnitude).
| fit | deviance | AIC | dispersion | total_error |
|---|---|---|---|---|
| poisson (2nt) | 3869944.6 | 3939067.4 | 1.00000 | 3915786 |
| quasipois (2nt) | 3869944.6 | NA | 262.72503 | 3915786 |
| negbin (2nt) | 13960.9 | 134116.0 | 1.00000 | 4537915 |
| poisson (3nt) | 914188.5 | 983501.3 | 1.00000 | 1532132 |
| quasipois (3nt) | 914188.5 | NA | 49.05492 | 1532132 |
| negbin (3nt) | 13448.1 | 124122.4 | 1.00000 | 1661988 |
## [1] "testing... poisson vs. negative binomial regression (2nt)"
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 3804953.4501 p-value = < 2.2e-16
## [1] "testing... poisson vs. negative binomial regression (3nt)"
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 859380.9567 p-value = < 2.2e-16
3nt models much lower deviance and total error than 2nt models
Negative binomial models much lower deviance than poisson and quasi-poisson models, but moderately higher total error (also demonstrated by residual plots above)
Likelihood ratio test indicate negative binomial regressions fit data much better than poisson regressions
Not much visual difference between poisson and quasi-poisson models
Residuals from negative binomial models tend to have wider range, but more observations at 0
Each point represents a (transcript, codon, 5’ digest length, 3’ digest length)
Mean is model prediction, variance is square of residual
Residuals from 2nt models very over-dispersed, not quite captured by quasi-poisson or negative binomial models
Dispersion better captured by negative binomial in 3nt models
5’ bias sequence: models predict poorly on AC, AT, and TC sequences
Otherwise, pretty good prediction (i.e. residuals mostly 0) on other sequences
Magnitude of residuals doesn’t tend to correlate with number of observations or footprint counts
After removing outliers, good correlation between regression parameters and simulation probabilities
Without removing outliers, not much difference in correlations between 2nt and 3nt models
After removing outliers, correlations improved in 3nt models
Data were simulated such that ribosome positions only depend on A site codon identity
Regression coefficients for E and P site codons should be 0
Among 2nt models, negative binomial regression resulted in more E and P site regression coefficients at 0
All 3nt models tend to have E and P site regression coefficients around 0
Regression coefficients for negative binomial model tighter around 0
## [1] "poisson : TAA,TAG,TGA"
## [1] "quasipois : TAA,TAG,TGA"
## [1] "negbin : TAA,TAG,TGA"
2nt: plotted 4 simulation parameters per bias sequence (i.e. for AA: plotted AAA, AAT, AAC, AAG)
Mostly linear correlation between regression coefficient and simulation probability
3nt: consistently 3 outliers (TAA, TAG, TGA)
Otherwise, good linear fit between regression coefficient and simulation probability
Outliers worse in negative binomial fit
Number of observations: TAA (441), TAG (171), TGA (1476)
Regression coefficients tend to be 0
Need to go back into simulation code to check for errors
set all 5’ and 3’ bias sequences to the reference level (AAA)
predict footprint counts with modified data
codon correlation plots to evaluate bias correction
collapse to counts per codon
scale footprint counts: divide raw counts by average across transcript
full model: -7 to +5 codons; plus 13 leave-one-out models
compute Pearson correlations btwn true & predicted scaled counts